For this problem you will use a simplified version of the Adult Census Data Set. In the subset provided here, some of the attributes have been removed and some preprocessing has been performed.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
#Read the data into panda's dataframe
df = pd.read_csv("adult-modified.csv", na_values=['?'])
df.head(10)
a. Preprocessing and data analysis:
*Examine the data for missing values. In case of categorical attributes, remove instances with missing values. In the case of numeric attributes, impute and fill-in the missing values using the attribute mean.
*Examine the characteristics of the attributes, including relevant statistics for each attribute, histograms illustrating the distributions of numeric attributes, bar graphs showing value counts for categorical attributes, etc.
*Perform the following cross-tabulations (including generating bar charts): education+race, work-class+income, work-class+race, and race+income. In the latter case (race+income) also create a table or chart showing percentages of each race category that fall in the low-income group. Discuss your observations from this analysis.
*Compare and contrast the characteristics of the low-income and high-income categories across the different attributes.
df.describe(include='all')
from first observation of missing values we see that age, workclass, education, marital-status, race, sex, hours-per-week, and income all have missing items. Categorical variables that need to be scrubbed for missing values are workclass, marital_status, race, sex, and income
df.columns
df["age"].plot(kind="hist", bins=10)
df['workclass'].value_counts().plot(kind='bar', color='red')
df["education"].plot(kind="hist", bins=10)
df['marital-status'].value_counts().plot(kind='bar', color='green')
df['race'].value_counts().plot(kind='bar', color='orange')
df['hours-per-week'].plot(kind="hist", bins=10)
df['income'].value_counts().plot(kind='bar', color='purple')
df.isnull().sum()
df.shape
df['workclass'].value_counts()
#interpolate using the mean in age
age_mean = df.age.mean()
df.age.fillna(age_mean, axis=0, inplace=True)
#dropping all the 588 records of na from workclass
df = df.dropna(subset=['workclass'])
df.shape
df.describe()
*Perform the following cross-tabulations (including generating bar charts): education+race, work-class+income, work-class+race, and race+income. In the latter case (race+income) also create a table or chart showing percentages of each race category that fall in the low-income group. Discuss your observations from this analysis.
gg = pd.crosstab(df["education"],df["race"])
gg
gg.plot(kind="bar",figsize=(10,7))
gg = pd.crosstab(df["workclass"],df["income"])
gg
gg.plot(kind="bar",figsize=(10,7))
gg = pd.crosstab(df["workclass"],df["race"])
gg
gg.plot(kind="bar",figsize=(10,7))
gg = pd.crosstab(df["race"],df["income"])
gg
gg.plot(kind="bar",figsize=(10,7))
#chart showing percentages of each race category that fall in the low-income group
gg = (pd.crosstab(df["race"],df["income"]) /pd.crosstab(df["race"],df["income"]).sum())*100
gg
from the bar graph and chart it is visible that the majority of the population for this data set are whites. The white population from this data set also have the 83% make less than 50K, followed by black at 11% and the least at for hispanic 1%. For higher than 50K income indviduals belong to white race. The lowest set of individuals belong to Amer-Indian race and they fall at 0.4%.
b. Predictive Modeling and Model Evaluation:
*Using either Pandas or Scikit-learn, create dummy variables for the categorical attributes. Then separate the target attribute ("income>50K") from the attributes used for training. [Note: you need to drop "income<=50K" which is also created as a dummy variable in earlier steps).
*Use scikit-learn to build classifiers uisng Naive Bayes (Gaussian), decision tree (using "entropy" as selection criteria), and linear discriminant analysis (LDA). For each of these perform 10-fold cross-validation (using cross-validation module in scikit-learn) and report the overall average accuracy.
*For the decision tree model (generated on the full training data), generate a visualization of tree and submit it as a separate file (png, jpg, or pdf) or embed it in the Jupyter Notebook.
df_modified = pd.get_dummies(df)
df_modified.head()
#separate the target attribute ("income_>50K")
df_target = df_modified['income_>50K']
df_target.head(10)
#drop "income_<=50K" which is also created as a dummy variable in earlier steps)
df_new = df_modified.drop(['income_<=50K','income_>50K'], axis=1)
df_new.head()
Use scikit-learn to build classifiers uisng Naive Bayes (Gaussian), decision tree (using "entropy" as selection criteria), and linear discriminant analysis (LDA). For each of these perform 10-fold cross-validation (using cross-validation module in scikit-learn) and report the overall average accuracy.
from sklearn import tree, naive_bayes
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn import model_selection
Naive Bayes (Gaussian)
nbclf = naive_bayes.GaussianNB()
nbclf = nbclf.fit(df_new, df_target)
cv_scores = model_selection.cross_val_score(nbclf, df_new, df_target, cv=10)
cv_scores
print("Overall Accuracy on cross-validation is: %0.2f (+/- %0.2f)" % (cv_scores.mean(), cv_scores.std() * 2))
Decision tree (using "entropy" as selection criteria)
treeclf = tree.DecisionTreeClassifier(criterion='entropy', min_samples_split=3)
treeclf = treeclf.fit(df_new, df_target)
cv_scores = model_selection.cross_val_score(treeclf, df_new, df_target, cv=10)
cv_scores
print("Overall Accuracy on cross-validation is: %0.2f (+/- %0.2f)" % (cv_scores.mean(), cv_scores.std() * 2))
Linear Discriminant Analysis (LDA)
ldclf = LinearDiscriminantAnalysis()
ldclf = ldclf.fit(df_new, df_target)
cv_scores = model_selection.cross_val_score(ldclf, df_new, df_target, cv=10)
cv_scores
print("Overall Accuracy on cross-validation is: %0.2f (+/- %0.2f)" % (cv_scores.mean(), cv_scores.std() * 2))
For the decision tree model (generated on the full training data), generate a visualization of tree and submit it as a separate file (png, jpg, or pdf) or embed it in the Jupyter Notebook.
import graphviz
from sklearn.tree import export_graphviz
from IPython.display import Image
treeclf = tree.DecisionTreeClassifier(criterion='entropy', min_samples_split=3)
treeclf = treeclf.fit(df_new, df_target)
export_graphviz(treeclf,out_file='tree.dot', feature_names=df_new.columns )
with open("tree.dot") as f:
dot_graph = f.read()
graphviz.Source(dot_graph)